###Practical 1: Pre-processing of Text Document

Aim:
To perform text pre-processing by removing stopwords and tokenizing a sentence.

Description:
Stopwords like “is, the, in” are common words that carry little meaning. Removing them helps focus on important words.
Tokenization splits text into smaller units called tokens.

##Code – Stopword Removal:

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Display stopwords
print(stop_words)


##Code – Tokenization and Filtering:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)

filtered_sentence = []
for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)





###Practical 2: Boolean Retrieval Model

Aim:
To implement the Boolean Retrieval Model for given documents.

Description:
Documents are represented as sets of words.
Boolean operators used:

AND ( ∧ ) → both terms must be present

OR ( ∨ ) → at least one term present

NOT ( ¬ ) → excludes documents with the term

Code:

import re

docs = {
    1: "Information Retrieval has 2 models and information.",
    2: "Boolean is a basic Information Retrieval classic model.",
    3: "Information is a data that processed, Information.",
    4: "When a Data Processed the result is Information, Data."
}

# All document IDs
all_docs = set(docs.keys())

# Function to check if a term exists in a document
def has_term(doc, term):
    words = set(re.findall(r'\w+', doc.lower()))
    return term in words

# Collect documents containing each term
data_docs = set(i for i, doc in docs.items() if has_term(doc, "data"))
info_docs = set(i for i, doc in docs.items() if has_term(doc, "information"))
retrieval_docs = set(i for i, doc in docs.items() if has_term(doc, "retrieval"))

# Evaluate query: (Data AND Information) OR (NOT Retrieval)
result = (data_docs & info_docs) | (all_docs - retrieval_docs)

print("Result for Query: (Data ^ Information) v (~ Retrieval)\n")
for i in sorted(result):
    print(f"Doc{i}:", docs[i])





###Practical 3: Vector Space Model

Aim:
To implement Vector Space Model using cosine similarity.

Description:
Documents and queries are represented as vectors.
Cosine similarity measures similarity: 1 → very similar, 0 → not similar.

Code:

import math, collections

docs = ["A man and a woman.", "A baby."]
vocab = sorted(set(w.lower().strip('.,') for d in docs for w in d.split()))

def vectorize(text):
    c = collections.Counter(w.lower().strip('.,') for w in text.split())
    return [c[t] for t in vocab]

def cosine_sim(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    mag_a = math.sqrt(sum(x * x for x in a))
    mag_b = math.sqrt(sum(y * y for y in b))
    return dot / (mag_a * mag_b) if mag_a and mag_b else 0

query = "woman"
q_vec = vectorize(query)
doc_num = 1
for doc in docs:
    sim = cosine_sim(vectorize(doc), q_vec)
    print("Doc", doc_num, "similarity:", round(sim, 3))
    doc_num += 1





###Practical 4: Web Spamming Detection

Aim:
To detect keyword stuffing and web spam.

Description:
Web spamming uses repeated keywords or hidden content to manipulate search rankings.
We can count word frequencies to detect potential spam.

Code:

from collections import Counter
import re

web_content = """Cheap watches available now! Best cheap watches for you.
Buy cheap watches online. Cheap cheap cheap watches watches!"""

# Extract words (lowercase, ignore punctuation)
words = re.findall(r'\b\w+\b', web_content.lower())

# Count word frequencies
keyword_counts = Counter(words)

# Threshold for spam detection
SPAM_THRESHOLD = 4

# Print frequencies
print("Keyword Frequencies:")
for word, count in keyword_counts.items():
    print(f"{word}: {count}")

# Detect potential spam keywords
print("\nPotential Spam Keywords:")
for word, count in keyword_counts.items():
    if count >= SPAM_THRESHOLD:
        print(f"'{word}' appears {count} times (possible keyword stuffing)")




###Practical 5: Text Summarization

Aim:
To implement extractive and abstractive text summarization.

Description:
Summarization reduces long text into a shorter version.

Extractive → selects important sentences

Abstractive → generates new sentences

Code:

# Install necessary packages (for Jupyter/Colab)
!pip install sumy transformers torch

import nltk
nltk.download('punkt_tab')
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from transformers import pipeline

# Sample text
text = """ Artificial intelligence (AI) is the technology that allows machines to simulate human
intelligence, enabling them to learn, reason, problem-solve, and make decisions. AI systems
achieve this by analyzing vast amounts of data to identify patterns, understand language, and
recognize objects, similar to how humans think and behave. Key applications of AI include
natural language processing, computer vision, and autonomous systems, impacting various
industries by automating tasks and improving decision-making. The four common types of
Artificial Intelligence (AI), based on their functionality, are: Reactive Machines, Limited
Memory, Theory of Mind, and Self-aware AI. Reactive machines, like the IBM Deep Blue chess
program, act on the present situation but don't store memories. Limited Memory AI, such as selfdriving cars, can use past data to inform current decisions. Theory of Mind and Self-aware AI
are currently conceptual stages of AI that would possess human-like understanding of emotions
and consciousness, respectively.
"""

# Extractive Summarization
def extractive_summary(text, num_sentences=3):
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = LexRankSummarizer()
    summary = summarizer(parser.document, num_sentences)
    return ' '.join(str(sentence) for sentence in summary)

# Abstractive Summarization
def abstractive_summary(text):
    summarizer = pipeline("summarization")
    summary = summarizer(text, max_length=100, min_length=25, do_sample=False)
    return summary[0]['summary_text']

# Run both summarizations
print("===== Abstractive Summary =====")
print(abstractive_summary(text))
print("\n===== Extractive Summary =====")
print(extractive_summary(text))





###Practical 6: Inverted Index

Aim:
To create an inverted index for document search.

Description:
An inverted index maps each word to documents in which it occurs.
It allows fast lookup for search engines.

Code:

from collections import defaultdict

# Function to create an inverted index
def create_inverted_index(documents):
    inverted_index = defaultdict(set)
    for doc_id, document in enumerate(documents):
        for word in document.split():
            inverted_index[word].add(doc_id)
    return inverted_index

# Sample documents
documents = [
    "This is the first document.",
    "Second document is here.",
    "And this is the third document."
]

# Create inverted index
inverted_index = create_inverted_index(documents)

# Display the inverted index
print(dict(inverted_index))




###Practical 7: Opinion Spam Detection

Aim:
To identify spam reviews in user feedback.

Description:
Spam reviews use promotional phrases like “Buy now”, “Limited offer”, or “Click here”.
Checking reviews for these keywords flags spam.

Code:

# List of sample reviews
reviews = [
    "This phone is amazing, battery lasts all day!",
    "Worst phone ever, waste of money!",
    "Buy this product now!!! Limited offer!!!",
    "Great value for money, highly recommended!"
]

# List of spam keywords
spam_words = ["buy now", "limited offer", "click here", "free"]

# Detect spam reviews
for r in reviews:
    if any(word in r.lower() for word in spam_words):
        print("SPAM:", r)
    else:
        print("GENUINE:", r)